Skip to content

Conversation

@Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Oct 16, 2025

Which issue does this PR close?

Rationale for this change

When reviewing #17085 I was very confused by the fix suggested, and tried to understand why AccumulatorArgs didn't have easy access to Fields of its input expressions, as compared to scalar/window functions which do. Introducing this new field should make it easier for users to grab datatype, metadata, nullability of their input expressions for aggregate functions.

What changes are included in this PR?

Add a slice of FieldRef to AccumulatorArgs so users don't need to compute the input expression fields themselves via using schema. This addresses #16997 as it was confusing to have only the schema available as there are valid (?) cases where the schema is empty (such as literal only input).

This fix differs from #17085 in that it doesn't special case for when there is literal only input; it leaves the physical schema provided to AccumulatorArgs untouched but provides a more ergonomic (and less confusing) API for users to retrieve Fields of their input arguments.

  • I'm still not sure if the schema being empty for literal only inputs is correct or not, so this might be considered a side step. If we could remove schema entirely from AccumulatorArgs maybe we wouldn't need to worry about this, but see my comment for why that wasn't done in this PR

Are these changes tested?

Existing unit tests.

Are there any user-facing changes?

Yes, new field to AccumulatorArgs which is publicly exposed (with all it's fields).

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate functions Changes to functions implementation ffi Changes to the ffi crate labels Oct 16, 2025
Copy link
Contributor Author

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to remove schema entirely, to follow what #11725 aims for. However there are some usages that I couldn't trivially fix:

let ordering_dtypes = ordering
.iter()
.map(|e| e.expr.data_type(acc_args.schema))
.collect::<Result<Vec<_>>>()?;

let ordering_dtypes = ordering
.iter()
.map(|e| e.expr.data_type(acc_args.schema))
.collect::<Result<Vec<_>>>()?;

let ordering_dtypes = ordering
.iter()
.map(|e| e.expr.data_type(acc_args.schema))
.collect::<Result<Vec<_>>>()?;

Not to mention it might be more breaking to remove it (we could deprecate it I guess).

Comment on lines +72 to +73
/// Fields corresponding to each expr (same order & length).
pub expr_fields: &'a [FieldRef],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main change is here

Comment on lines +234 to +237
let arg_fields = args
.iter()
.map(|e| e.return_field(schema.as_ref()))
.collect::<Result<Vec<_>>>()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how we construct the eventual expr_fields; we're essentially doing this e.return_field(schema) pattern up front for the user, instead of requiring them to do it each time they need the field (see the various other fixes in this PR which is replacing those accesses with a simplified version)

@Jefffrey
Copy link
Contributor Author

fyi @kosiew

I tried implementing like this and it seems like no issues with regressions, thoughts on if this fix is simpler?

@Jefffrey Jefffrey marked this pull request as ready for review October 16, 2025 15:47
@Jefffrey Jefffrey added the api change Changes the API exposed to users of the crate label Oct 16, 2025
@kosiew
Copy link
Contributor

kosiew commented Oct 17, 2025

@Jefffrey

Your approach is an improvement!

✅ Simpler implementation - straightforward addition of pre-computed fields
✅ Less cognitive overhead - users don't need to understand schema synthesis
✅ Consistent API - same approach works for all cases (columns, literals, mixed)
✅ Performance - fields computed once during construction, not per-accumulator
✅ Explicit contract - makes it clear that field metadata is always available

👍 👍 👍

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @Jefffrey ,

I think you missed these:

https://github.com/apache/datafusion/blob/9bfa2ae/datafusion/physical-expr/src/aggregate.rs#L616

When AggregateFunctionExpr::with_new_expressions rewrites an aggregate to use new argument expressions, it clones the prior arg_fields. This means the expr_fields handed to the accumulator can carry metadata from the old expressions instead of the newly supplied ones, defeating the purpose of the added field (e.g., extension metadata will never update for rewritten expressions).


https://github.com/apache/datafusion/blob/9bfa2ae/datafusion/functions-aggregate/src/string_agg.rs#L194-L203

https://github.com/apache/datafusion/blob/9bfa2ae/datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs#L204-L224

Both approx_percentile_cont_with_weight and the ordered/distinct branch of string_agg build nested AccumulatorArgs by filtering acc_args.exprs, but they reuse the original expr_fields slice unchanged. After the filtering, the positions (and lengths) of exprs and expr_fields no longer match, so downstream code will read the wrong field metadata once it looks beyond the first argument.

@Jefffrey
Copy link
Contributor Author

hi @Jefffrey ,

I think you missed these: https://github.com/Jefffrey/datafusion/blob/acc_args_input_fields/datafusion/physical-expr/src/aggregate.rs#L625-L666

When AggregateFunctionExpr::with_new_expressions rewrites an aggregate to use new argument expressions, it clones the prior arg_fields. This means the expr_fields handed to the accumulator can carry metadata from the old expressions instead of the newly supplied ones, defeating the purpose of the added field (e.g., extension metadata will never update for rewritten expressions).

https://github.com/Jefffrey/datafusion/blob/acc_args_input_fields/datafusion/functions-aggregate/src/string_agg.rs#L194-L203

https://github.com/Jefffrey/datafusion/blob/acc_args_input_fields/datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs#L204-L224

Both approx_percentile_cont_with_weight and the ordered/distinct branch of string_agg build nested AccumulatorArgs by filtering acc_args.exprs, but they reuse the original expr_fields slice unchanged. After the filtering, the positions (and lengths) of exprs and expr_fields no longer match, so downstream code will read the wrong field metadata once it looks beyond the first argument.

Thanks for picking up on this; I actually did try doing the change inside AggregateFunctionExpr::with_new_expressions but ran into some issues with async UDFs. I forgot to consider the metadata would change too. I'll look into this and see if I can understand what's going on.

@Jefffrey Jefffrey marked this pull request as draft October 17, 2025 13:25
@Jefffrey
Copy link
Contributor Author

1c87652 fixes the AccumularArgs reconstruction for approx_percentile_cont_with_weight and string_agg.

Still looking into AggregateFunctionExpr::with_new_expressions

@Jefffrey Jefffrey marked this pull request as ready for review October 18, 2025 06:32
@Jefffrey
Copy link
Contributor Author

I think there is a wider issue with with_new_expressions(); I've raised #18149 for it and left a comment in this PR, hopefully that is sufficient @kosiew ?

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pub struct ForeignAccumulatorArgs {
pub return_field: FieldRef,
pub schema: Schema,
pub expr_fields: Vec<FieldRef>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be a breaking FFI change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since FFI_AccumulatorArgs seems to be the one marked as being stable across FFI boundaries:

/// A stable struct for sharing [`AccumulatorArgs`] across FFI boundaries.
/// For an explanation of each field, see the corresponding field
/// defined in [`AccumulatorArgs`].
#[repr(C)]
#[derive(Debug, StableAbi)]
#[allow(non_camel_case_types)]
pub struct FFI_AccumulatorArgs {
return_field: WrappedSchema,
schema: WrappedSchema,
is_reversed: bool,
name: RString,
physical_expr_def: RVec<u8>,
}

Though I am not familiar with the FFI related code.

Copy link
Contributor

@alamb alamb Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @timsaucer -- perhaps you can confirm this doesn't mess up the FFI API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate core Core DataFusion crate ffi Changes to the ffi crate functions Changes to functions implementation physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AccumulatorArgs.schema is empty when passing in scalar input

3 participants